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1. INTRODUCTION 

Deoxyribonucleic acid (DNA) carries the genetic information for humans and practically all other 
creatures. The DNA of an individual may be found in practically all of their cells. Nuclear DNA is a type of 
DNA that is found inside the cell nucleus. There is also a tiny quantity of DNA in mitochondria, which is 
called the mitochondrial DNA or mtDNA [1]-[7]. 

The code that stores an individual’s information in the DNA is made up of four chemical bases: 
adenine (A), guanine (G), cytosine (C), and thymine (T). Over 99% of the 3 billion bases in each person are 
identical. The order or sequence of these bases determines the information that is responsible for forming the 
creature, much like how the letters of the alphabet appear in a specific order to create words and sentences. 
The remaining DNA percentage is so valuable as it differs between individuals, so, it can be used for 
recognition. Replicating the chemical bases in a DNA is a crucial characteristic as it is exploited as a 
blueprint for individual’s recognition. The blueprint for replicating or sequencing the bases is found in the 
double helix DNA strands [1]-[7]. 

Two lengthy strands of nucleotides combine to form the double helix spiral that is DNA. 
A nucleotide is composed up of a base, a sugar and a phosphate. DNA nucleotides combine to form units of 
base pairs when A is combined with T and C is combined with G. Each base also has a phosphate and sugar 
molecule attached to it. In the double helix structure, the base pairs act as the rungs and the sugar and 
phosphate molecules as the vertical side rails of the ladder. Figure 1 [1]-[7] provide a DNA demonstration. 
Each cell contains 46 long structures called chromosomes that are distributed with DNA instructions. These 
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chromosomes [1]—[7] are made up of many smaller pieces of DNA, called genes. There are two sources of 
chromosomes, which are the parents. That is, a chromosome comes from a mother and another chromosome 
comes from a father. 

The main aim and contribution of this paper is generating a novel deep learning model. It is named 
the special DNA deep learning (SDDL). This model is employed for identifying DNAs of individuals based 
on their chromosomes. Following the introduction, the structure of this essay is as follows: section 2 reviews 
prior work, section 3 demonstrates the proposed SDDL model, section 4 exhibits the findings and discusses 
the outcomes, and section 4 concludes the paper. 
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Figure 1. A DNA demonstration for the location and shape 


2. PRIOR WORK 

Minaee et al. [1] presented an overview related to the DNA as a biometric data identifier. 
A comprehensive survey and review of different types of biometrics are presented in [2]-[7]. The objective 
of the research in [8] was to develop an algorithm that could analyze the DNA sequence of an systemic lupus 
erythematosus (SLE) patient and predict the killer T-cell responses. Comparing a gene’s DNA coding 
sequence from the reference genome to that gene’s coding sequence in a patient’s genome was the method 
used to identify gene variation in a patient. The threshold for significant single nucleotide polymorphisms is 
0.1% of the DNA sequence that codes for a gene’s length. Using the suggested approximation sequence 
matching method, the matching was done. The findings indicate that each of the 16 subjects will have 
autoimmune killer T-cells. Additionally, the algorithm’s accuracy and predictive power both reached 80%. 
The DNA is used to identify individuals [9]. An efficient method is used to find the distinctive DNA patterns. 
The term unique personal DNA pattern (UPDP) is employed. Four datasets are used in this article. These are 
for DNA classification (DC), DNA sequences (DS), sample DNA sequence (SDS), and human DNA sequences 
(HDS). Identification yields outcomes that are so fascinating and amazing. False acceptance rates (FARs) were 
achieved for the DC, SDS, HDS, and DS are 2.07%, 1.41%, 0.26%, and 0.75%, respectively. However, for the 
four datasets, all false rejection rates (FRRs) are recorded as 0%. Two DNA sequencing methods are evaluated 
in [10], where these algorithms were the Rabin-Karp (R-K) and maximum common substream (MCS). 
Different code implementations and methods were used to evaluate these two approaches. Accuracy and 
performance were the two parameters used to assess the work. Study’s goal Afolabi and Akintaro [11] was to 
present a summary of current technological advancements in the area of biometric security, with a focus on 
the effects that the usage of DNA-based biometric systems on both human lives and cyber security. 
Additionally, creating a biometric system based on the DNA for identifying people in order to lower the level 
of precision at which current technologies are insufficient for a system of universal identity (ID). Signal 
processing was utilized to condense DNA sequences [12]. While, the DNA typing based on forensics was 
discussed in [13], [14]. Further new studies are provided for different DNA applications [15]-[21]. It can be 
noticed that suggesting a special machine learning (ML) model which is able to identify an individual with 
his/her parents is valuable. This study focuses on proposing a novel deep learning approach that can provide 
such facility. 
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3. PROPOSED METHOD 

In this paper, a new deep learning model or network termed the SDDL is created. This network has 
five layers, these are the: input layer (chromosome layer), the 1‘ hidden layer (distance layer), the 2" hidden 
layer (impulse response (IR) layer), the 3 layer (concatenation layer) and the output layer (decision layer). 
The first two hidden layers represent a feature extraction part, whereas, the last two layers represent a 
classifier part. The infrastructure of the novel SDDL model is given in Figure 2. 
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Figure 2. Infrastructure of the novel SDDL model 


The input layer accepts two input vectors (X4 and X3). The first vector X, represents input values 
from the chromosome of a mother. The second vector X, represents input values from the chromosome of a 
father. The 1“ hidden layer calculates Euclidean distance between the inputs and weights for each 
chromosome as in (1) and (2): 


Z (i1,j1) = ||X_1 — W _(i1, j1) | i1 = 1,2, ...,n1,j1 = 1,2, ..., m1 (1) 
Z_(i2, j2) = \|X_2 — W_(i2, j2) |] ,i2 = 1,2,...,n2,j2 = 1,2,...,m2 (2) 


where Z;,,;1 represents a node value of the 1“ hidden layer for the chromosome of a mother, Wj; j1 represents 
the weight vector of the 1* hidden layer for the chromosome of the mother, n1 represents the number of 
chromosome values for the mother, m1 represents the number of chromosome training vectors for the 
mother, Ziz, ;2 represents a node value of the 1“ hidden layer for the chromosome of a father, W;z j2 represents 
the weight vector of the 1“ hidden layer for the chromosome of the father, n2 represents the number of 
chromosome values for the father and m2 represents the number of chromosome training vectors for the 
father. The 2™ hidden layer computes an IR (6;; or 6;2) of the calculated Euclidean distance, where 6)1 
represents an IR for the chromosome of the mother and 6), represents an IR for the chromosome of the 
father. A demonstration of the employed IR function in the 2™ hidden layer is shown in Figure 3. 
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Figure 3. A demonstration of the employed IR function in the 2™ hidden layer 


In fact, the outcome of the Euclidean distance is a value of either O or 1. These values are within an 
acceptable tolerance that is acceptable in the employed real Iraqi datasets. The equations of the IR function 
here can be calculated as (3) and (4): 


1 ifZ;,;,=Oorl 
ô; = { 11,j1 3 
to otherwise GB) 
1 ifZ;.,=Oorl 
ô; = { 12,j2 4 
J2 0 otherwise 4) 


Hence, the 3" layer concatenates together the values of 6;, and 6;2. As mentioned, ôjı is for the 
mother’s chromosome and 6; is for father’s chromosome. The 3" layer’s equation can be expressed as (5): 


Cx = 10 if Oj1 =1 and Oj2 =0 ‘ k = 1,2, "q (5) 
01 if ôj4 = 0 and ôj2 =1 


where C, represents a node value of the 3" hidden layer for both chromosomes and q represents the number 
of chromosome values for the mother (n1) or father (n2) as n1 = n2. 

Consequently, the output layer produces the decision outcomes. Each decision outcome is 
represented by the required identification value. It can be computed according as (6): 


2 2G = 11 
Deoti ifCk=10,k=1,2,...,q (6) 
—1 iC, =01 


where D, represents the identification value of the output layer. 

It is worth mentioning that the training weights are directly initialized from the input training 
vectors. Same idea of initializing the training weights is explained in [22]. This yields important advantages 
for the proposed SDDL model as: i) it is so flexible, where it allows adding and removing hidden and output 
nodes; ii) it does not require to do iterations during the training stage, so, its train is so fast; iii) it does not 
deceive by the problem of local error in training as other deep learning models which use the 
backpropagation training algorithm; and iv) it has the ability to identify one or two parents of an individual. 


4. RESULTS AND DISCUSSION 
4.1. Datasets descriptions 

Two real datasets from Iraq are employed in this study. The first one is named the Real Iraqi Dataset 
for Kurd (RIDK). The second one is called the Real Iraqi Dataset for Arab (RIDA). Any individual has a 
chromosome of 30 values, 15 values from a mother and 15 values from a father. The RIDK dataset has 
chromosome values for 52 persons. Whereas, The RIDA dataset has chromosome values for 200 persons. 

As a tolerance of +1 for each value in a chromosome is acceptable in real consideration by Iraqi 
forensic medicine, this has been employed to establish training data. So, the augmentation of applying +1 for 
each value in a chromosome is used for both datasets. As such, 1560 training data are produced for the RIDK 
dataset and 6000 training data are generated for the RIDA dataset. On the other hand, the real values of both 
datasets are used in the testing phase. That is, 52 testing data are applied for the RIDK and 200 testing data 
are utilized for the RIDA dataset. 
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4.2. SDDL performances 

To evaluate the SDDL, the accuracies and times are considered. They are applied for the two 
employed datasets. Actually, accuracies and times are considered for fully applying the employed RIDK and 
RIDA datasets. First of all, Table 1 shows the SDDL performances of accuracies and times for both datasets. 


Table 1. SDDL performances of accuracies and times for the employed datasets 


Employed datasets RIDK dataset ___RIDA dataset 
Number of training samples 1560 6000 
Number of testing samples 52 200 

Training time 1.24 Sec. 6.19 Sec. 
Testing time 0.10 Sec. 1.08 Sec. 
Accuracy 100% 100% 


The results show that the proposed system is successful with fantastic performance accuracies of 
100% for both utilized datasets. In addition, the training and testing time are considered in this work and the 
proposed SDDL model shows very short training and testing times. It can be noticed that the SDDL is so fast 
in training as it reports a very short training time. This can be a remarkable outcome for this proposed 
approach. 

To approve the SDDL validity, it is compared to other deep neural networks. It also shows 
outperformances compared to other models or networks for the same training and testing samples used in 
Table 1, the comparisons are detailed in Table 2. That is, the suggested algorithm outperformed previous 
deep learning models of the stacked autoencoder (SA) [23], deep autoencoder network (DAN) [24], and 
autoencoder deep learning (ADL) [25] in terms of its flexibility, training time, mean square error (MSE) and 
its ability to recognize parents. Regarding the flexibility, the SDDL can be enlarged or reduced without the 
requiring of re-train again. On the other hand, other compared deep learning networks have specific 
parameters, as numbers of hidden layers and neurons, to be determined. For training time, it is obvious that 
the SDDL approach has recorded the lowest training times compared to other deep learning models for both 
employed datasets. This can be considered as a significant advantage of the SDDL. For testing time, the 
proposed SDDL has taken longer time than other deep learning models. However, its testing time still small 
and can be acceptable especially for the RIDK dataset. If a single input is considered, the testing time will 
have not significant effect. The proposed SDDL outperforms previous deep neural networks as it is the only 
one that can provide the lowest MSE of 0 value for both employed datasets. In addition, the SDDL approach 
has the capability to recognize the parent or parents of identified persons. Whilst, such ability is not provided 
by any of the other compared deep learning models. 


Table 2. Comparisons between the proposed novel SDDL and other deep neural models (for the same 
training and testing samples used in Table 1) 


Deep learning Pitaiietes Error and time for The Error and time for The Ability to 
model Ridk dataset Rida dataset recognize parents 
SA [23] NoHL=3 MSE=0.08 MSE=0.02 No 
NoHN in the 1* HL=25 TRT=6.59 Sec. TRT=20.33 Sec. 
NoHN in the 2™ HL=20 TET=0.04 Sec. TET=0.01 Sec. 
NoHN in the 3 HL=15 
DAN [24] NoHL=3 MSE=0.08 MSE=0.02 No 
NoHN in the 1* HL=64 TRT=10.99 Sec. TRT=35.67 Sec. 
NoHN in the 24 HL=64 TET=0.01 Sec. TET=0.02 Sec. 
NoHN in the 3" HL=64 
ADL [25] NoHL= 4 MSE=0.08 MSE=0.02 No 
NoHN in the 1“ HL = 30 TRT=8.62 Sec. TRT=25.40 Sec. 
NoHN in the2™ HL = 30 TET=0.01 Sec. TET=0.02 Sec. 


NoHN in the 3 HL = 30 
NoHN in the 4" HL = 30 


Proposed NoHL=3 MSE=0 MSE=0 Yes 
SDDL NoHN in the 1“ HL (flexible)=2x no. © TRT=1.24 Sec. TRT=6.19 Sec. 
of training vectors TET=0.10 Sec. TET=1.08 Sec. 


NoHN in the 2" HL (flexible)=2x 
no. of training vectors 

NoHN in the 3" HL (flexible)= no. 
of training vectors 


Where NoHL is the number of hidden layers, NoHN is the number of hidden nodes, and HL is the hidden 
layer. TRT is the training time and TET is the testing time. 
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5. CONCLUSION 

In order to identify people based on their DNAs, a novel SDDL model is proposed in this study. 
This suggested approach can identify either one parent or both parents for an individual depending on the 
provided chromosomes. The SDDL is flexible as it can be enlarged or reduced accordingly. During the phase 
of training, it does not require iterations and it does not suffer from the local error. Its training is also so 
quicker than other compared deep learning models. Two real datasets from Iraq are employed and termed the 
RIDK and RIDA. For each of both datasets, the provided approach achieves highest accuracy of 100%. 
It also performs better and gives the lowest MSE of 0 value compared to other deep learnings. 
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